394 research outputs found
Multimodal Clustering with Role Induced Constraints for Speaker Diarization
Speaker clustering is an essential step in conventional speaker diarization
systems and is typically addressed as an audio-only speech processing task. The
language used by the participants in a conversation, however, carries
additional information that can help improve the clustering performance. This
is especially true in conversational interactions, such as business meetings,
interviews, and lectures, where specific roles assumed by interlocutors
(manager, client, teacher, etc.) are often associated with distinguishable
linguistic patterns. In this paper we propose to employ a supervised text-based
model to extract speaker roles and then use this information to guide an
audio-based spectral clustering step by imposing must-link and cannot-link
constraints between segments. The proposed method is applied on two different
domains, namely on medical interactions and on podcast episodes, and is shown
to yield improved results when compared to the audio-only approach.Comment: Submitted at Interspeech 202
Generating Labels for Regression of Subjective Constructs using Triplet Embeddings
Human annotations serve an important role in computational models where the
target constructs under study are hidden, such as dimensions of affect. This is
especially relevant in machine learning, where subjective labels derived from
related observable signals (e.g., audio, video, text) are needed to support
model training and testing. Current research trends focus on correcting
artifacts and biases introduced by annotators during the annotation process
while fusing them into a single annotation. In this work, we propose a novel
annotation approach using triplet embeddings. By lifting the absolute
annotation process to relative annotations where the annotator compares
individual target constructs in triplets, we leverage the accuracy of
comparisons over absolute ratings by human annotators. We then build a
1-dimensional embedding in Euclidean space that is indexed in time and serves
as a label for regression. In this setting, the annotation fusion occurs
naturally as a union of sets of sampled triplet comparisons among different
annotators. We show that by using our proposed sampling method to find an
embedding, we are able to accurately represent synthetic hidden constructs in
time under noisy sampling conditions. We further validate this approach using
human annotations collected from Mechanical Turk and show that we can recover
the underlying structure of the hidden construct up to bias and scaling
factors.Comment: 9 pages, 5 figures, accepted journal pape
- …